1 Question 1

1.1 Loading in Data

q1<-read.table("~/Downloads/commu_crime_na_filled.txt", header=TRUE)
q1<-q1[,-90]
q1.names<-read.csv("~/Downloads/communities.data", header=FALSE)
q1

1.2 Hierarchical Clustering

Our first method will be hierarchical clustering. I will set a seed (4) for this and future manipulations to ensure consistency when this .RMD publishes.

The chunk below scales the data, creates a distance matrix, and then creates a dendrogram using complete linkage.

The code included will also create objects containing the hclust object and the assigned clusters for each city in the data set. I chose to cut the dendogram at a distance of 4.2, which creates four clusters.

set.seed(4)
sd.data =scale(q1)
#par(mfrow =c(1, 3))
data.dist =dist(q1)
plot(hclust(data.dist), main = "Complete Linkage", labels = q1.names[,4],
     xlab = "",sub = "", ylab = "", cex = 0.5)
abline(h = 4.2, col = "red")

hc.out =hclust(dist(sd.data))
hc.clusters =cutree(hc.out, 4.2)

1.3 K-means Clustering

My next method will be k-means clustering. In this method, we have to choose the number of clusters that we want to end up with. Given the fact that we have four clusters from hierarchical clustering, it seems appropriate to create four clusters with the other methods (if able).

1.3.1 PCA for Question 1 - Axes for K-means Clustering Plot

Given the multi-dimensionality of the data, it makes sense to perform principal component analysis (PCA) on the data set to make it easier to display the clusters visually.

The code below will perform PCA on the data set. It will also tell us a fair amount of information about the resultant principal components.

set.seed(4)
pr.out=prcomp(q1, scale=TRUE)
pr.var=pr.out$sdev^2
pve=pr.var/sum(pr.var)
pca.names<- 1:89
head(data.frame(pca.names, pve*100),10)%>% 
  mutate(across(where(is.numeric), ~ round(., 1)))%>%
   kbl( 
    caption = "<center><strong>PCA Variance Explanation</strong></center>",
      col.names = c("PC #",
                    "Variance Explained (%)")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
PCA Variance Explanation
PC # Variance Explained (%)
1 28.1
2 17.1
3 8.8
4 7.4
5 4.2
6 3.6
7 3.2
8 2.2
9 1.7
10 1.7
fviz_eig(pr.out)

plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')

So, we can see from the plot above that the first principal component accounts for ~28% of the variability between cities in the dataset, the second accounts for 17.1%, the third 8.8%, and so on.

The last plot also shows us how the proportional contribution of the individual principal components adding up to account for the variance in the data set.

Now, let’s get to k-means clustering.

The code below will create k-means clusters out of our data and create plots to display how the cities in the data set were clustered with this method. The first plot shows the data in two dimensions plotted against the first two principal components, and the second plot shows the data plotted against the first three principal components.

set.seed(4)
km.out =kmeans(q1, 4, nstart = 20)
fviz_cluster(km.out, data = q1,
            # palette = c("#2E9FDF", "#00AFBB", "#E7B800"), 
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_bw()
             )

comp<-data.frame(pr.out$x[,1:3])
library(plotly)
plot_ly(x=comp$PC1, y=comp$PC2, z=comp$PC3, type="scatter3d", mode="markers", color=km.out$cluster)

Make note of this pattern, because I’ll display the results of our clustering methods in this way so that we can make a comparison.

For starters, I’ll show the comparable plot for the hierarchical clusters we created previously on the first two and three principal components to see if there are differences:

fviz_cluster(list(data = q1, cluster=hc.clusters), 
             ellipse.type = "norm",
             ellipse.level = 0.68,
             geom = "point",
             palette = "jco",
             ggtheme = theme_minimal())

plot_ly(x=comp$PC1, y=comp$PC2, z=comp$PC3, type="scatter3d", mode="markers", color=hc.clusters)

There are similarities, but also clear differences.

Note the differences in the shapes of the clusters. One of the most notable differences is that in k-means clustering, cities with PC2 < -5 and PC1 < 5 are almost entirely clustered together. With hierarchical clustering, there is a cluster that contains almost all of the cities with a PC1 value less than -5.

Now let’s move on to our last clustering method.

1.4 Fuzzy Clustering

Soft or fuzzy clustering is a form of clustering in which each data point is partially classified into more than one cluster depending on its features. Membership grades are assigned to each of the data point based on the degree to which it belongs to each cluster. Points on the edge of a cluster, with lower membership grades, may be in the cluster to a lesser degree than points in the center of cluster. Essentially, this leaves room for grey area in its amount of ‘fit’. Hard clustering, the opposite, forces each data point into a cluster without appreciating its features that might fit well in others.

The code below will use cmeans() to create fuzzy clusters in the data set. I use the corrplot function below to show the first 100 data points and how well they fit into each of the four clusters. The darker the dot, the more that datapoint belongs in the indicated cluster.

set.seed(4)
library(e1071)
fuzz<-cmeans(q1, 4)

library(corrplot)
par(mfrow=c(1,4))
corrplot(head(fuzz$membership,25), is.corr = FALSE)
corrplot(fuzz$membership[26:50,], is.corr = FALSE)
corrplot(fuzz$membership[51:75,], is.corr = FALSE)
corrplot(fuzz$membership[76:100,], is.corr = FALSE)

par(mfrow=c(1,1))

Below are the 2D and 3D plots of the data assigned to their respective soft cluster plotted on the principal components we created. Note that there are clear differences between these and the clusters we created using k-means and hierarchical clustering.

fviz_cluster(list(data = q1, cluster=fuzz$cluster), 
             ellipse.type = "norm",
             ellipse.level = 0.68,
             geom = "point",
             palette = "jco",
             ggtheme = theme_bw())

plot_ly(x=comp$PC1, y=comp$PC2, z=comp$PC3, type="scatter3d", mode="markers", color=fuzz$cluster)

1.5 Clusters by Method

Okay, now let’s compare the patterns of how data points were assigned to clusters using each of the methods. The table below includes the first 25 cities in the data set and their assigned cluster using each method.

What is important here is NOT what numerical cluster each city was assigned to, but patterns of which cities were clustered together. Looking at the table, it is apparent that some cities were clustered together by one or two methods, but not the other. This reinforces that there are differences in how the data points were clustered using the different methods.

clust<-data.frame(q1.names[,4], km.out$cluster, hc.clusters, fuzz$cluster)
head(clust,25)%>%
  kbl( 
    caption = "<center><strong>Clustering by Method</strong></center>",
      col.names = c("","K-Means Clustering",
                    "Hierarchical Clustering",
                    "Fuzzy Clustering")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Clustering by Method
K-Means Clustering Hierarchical Clustering Fuzzy Clustering
Lakewoodcity 3 1 3
Tukwilacity 3 2 3
Aberdeentown 2 1 2
Willingborotownship 4 3 4
Bethlehemtownship 3 1 4
SouthPasadenacity 4 3 4
Lincolntown 3 1 4
Selmacity 1 4 2
Hendersoncity 2 4 2
Claytoncity 4 3 4
DalyCitycity 1 2 4
RockvilleCentrevillage 4 3 4
Needhamtown 4 3 4
GrandChutetown 3 1 4
DanaPointcity 4 3 4
FortDodgecity 2 1 2
Albanycity 2 4 2
Denvilletownship 4 3 4
Valparaisocity 3 1 3
Rostravertownship 3 1 2
Modestocity 1 2 3
Jacksonvillecity 2 4 2
KlamathFallscity 2 1 2
SiouxCitycity 2 1 2
Delanocity 2 4 2

Last, we can see in the tabyl below that there are differences in how cities were assigned to clusters using k-means and hierarchical methods within each soft cluster. The heterogeneity in numbers of cities assigned to clusters (note totals) again speaks to differences in resultant clusters using the different methods.

clust%>%
  select(km.out.cluster,hc.clusters, fuzz.cluster)%>%
 rename("K-Means"= km.out.cluster,"Hierarchical"= hc.clusters, "Fuzzy" = fuzz.cluster )%>%
    tabyl(`K-Means`, Hierarchical, Fuzzy)%>%
    adorn_totals(where = c("row", "col")) %>%
    adorn_title(placement = "combined") %>%
  kbl(caption="<center><strong>Clustering by Method (Fuzzy Cluster 1-4 Displayed from Left to Right)</strong></center>",  longtable = TRUE)%>%
 kable_paper(html_font = "Cambria", latex_options="scale_down")
Clustering by Method (Fuzzy Cluster 1-4 Displayed from Left to Right)
K-Means/Hierarchical 1 2 3 4 Total
1 0 4 0 1 5
2 0 0 0 0 0
3 5 3 0 2 10
4 0 0 0 0 0
Total 5 7 0 3 15
K-Means/Hierarchical 1 2 3 4 Total
1 0 4 5 92 101
2 65 17 2 540 624
3 81 18 0 29 128
4 0 0 0 0 0
Total 146 39 7 661 853
K-Means/Hierarchical 1 2 3 4 Total
1 0 50 8 14 72
2 0 0 0 2 2
3 155 86 9 31 281
4 0 0 0 0 0
Total 155 136 17 47 355
K-Means/Hierarchical 1 2 3 4 Total
1 0 32 22 0 54
2 0 0 0 0 0
3 213 67 23 3 306
4 56 30 324 1 411
Total 269 129 369 4 771

2 Question 2

2.1 Loading in Data

q2<- read.csv("~/Downloads/Ch10Ex11.csv", header=FALSE)
q2
q2<-t(q2)

2.2 PCA for Question 2

Let’s start by performing PCA on the data. The code below creates objects out of the PCA analysis on the data set, the variance explained by each principal component, and the percentage of the variance in the whole data set that each principal component explains.

set.seed(4)
pr.out=prcomp(q2, scale=TRUE)
pr.var=pr.out$sdev^2
pve=pr.var/sum(pr.var)

Now let’s take a look at these ideas visually. The first plot below shows the proportion of variance in the data set that the first ten principal components explain. This information is also shown in a table for ease of reading.

pca.names<- 1:40
fviz_eig(pr.out)

head(data.frame(pca.names, pve*100),10)%>% 
  mutate(across(where(is.numeric), ~ round(., 1)))%>%
   kbl( 
    caption = "<center><strong>PCA Variance Explanation</strong></center>",
      col.names = c("PC #",
                    "Variance Explained (%)")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
PCA Variance Explanation
PC # Variance Explained (%)
1 8.1
2 3.4
3 3.3
4 3.2
5 3.1
6 3.0
7 2.9
8 2.9
9 2.9
10 2.9

Last, the code below will create a biplot showing each patient plotted against the first two principal components. I see a clear pattern here - don’t you? It looks like there is a wide gap on the first principal component axis between the first twenty patients (healthy) and second twenty patients (diseased).

fviz_pca_ind(pr.out, col.ind = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE 
             )

2.3 Clustering Question 2 - Hierarchical Clustering

Now let’s move on to hierarchical clustering.

2.3.1 Correlation-Based Distance

The code below scales the data and creates an object containing the correlation-based distance before creating a dendogram.

comp<-data.frame(pr.out$x[,1:4])

set.seed(4)
sd.data =scale(q2)
data.dist = get_dist(sd.data, method="pearson")

plot(hclust(data.dist), method="complete", main = "Complete Linkage",
     xlab = "",sub = "", ylab = "", cex = 0.5)
abline(h = 1.09, col = "red")

We can see that the first twenty patients and last twenty patients are clustered separately, with a distance of ~1.09. Interesting that there are differences between healthy and diseased patients…

Just to verify this notion, let’s check and see which patients ended up in which cluster. I’ve created a variable to indicate whether patients are healthy or diseased and juxtaposed it to their assigned cluster.

hc.out =hclust(data.dist)
hc.clusters = cutree(hc.out, 2)

pt.type<-data.frame(row(q2)[,1],hc.clusters)%>%
  mutate(patient = case_when(
    (row_number() %in% 1:20)~ "Healthy",
    TRUE~ "Diseased"
  ))

pt.type%>% 
  select(patient, hc.clusters)%>%
   kbl( 
    caption = "<center><strong>Healthy versus Diseased Clustering</strong></center>",
      col.names = c("Patient Type",
                    "Cluster")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Healthy versus Diseased Clustering
Patient Type Cluster
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Healthy 1
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
Diseased 2
table(pt.type$patient, pt.type$hc.clusters)
          
            1  2
  Diseased  0 20
  Healthy  20  0

Well that settles it. The differences between healthy and diseased patients were enough to lead to their being clustered separate from each other. There doesn’t appear to be any overlap.

Interestingly, the first principal component appears to separate the clusters best. The two 3D plots below plot the clusters against PC1, PC2, and PC3 and then PC2, PC3, and PC4. The clusters separate nicely on PC1 in the first plot, but then become mixed without any obvious pattern in the second. Whatever is in PC1 must be important to the differences that exist in healthy and diseased patients.

plot_ly(x=comp$PC1, y=comp$PC2, z=comp$PC3, type="scatter3d", mode="markers", color=hc.clusters)
plot_ly(x=comp$PC2, y=comp$PC3, z=comp$PC4, type="scatter3d", mode="markers", color=hc.clusters)

2.3.2 Comparing Linkage Types

The code below will perform hierarchical clustering using complete, average, and single linkage methods. All of the resultant dendograms show a similar pattern in clustering - healthy and diseased. To prove this, see the table created below indicating each patient, their status (healthy or diseased), and their assigned cluster using each linkage strategy. The pattern is the same across the baord.

In sum, no matter what kind of linkage we use, the result is the same.

set.seed(4)
hc.complete=hclust(data.dist, method = "complete")
hc.average=hclust(data.dist, method="average")
hc.single=hclust(data.dist, method="single")

par(mfrow=c(1,3))
plot(hc.complete,main="Complete Linkage", xlab="", sub="", cex=.9)
plot(hc.average, main="Average Linkage", xlab="", sub="", cex=.9)
plot(hc.single, main="Single Linkage", xlab="", sub="", cex=.9)

hc.co<-cutree(hc.complete, 2)
hc.av<-cutree(hc.average, 2)
hc.si<-cutree(hc.single, 2)

pt.type<-data.frame(row(q2)[,1],hc.clusters, hc.co, hc.av, hc.si)%>%
  mutate(patient = case_when(
    (row_number() %in% 1:20)~ "Healthy",
    TRUE~ "Diseased"
  ))

pt.type%>% 
  select(patient,  hc.co, hc.av, hc.si)%>%
   kbl( 
    caption = "<center><strong>Healthy versus Diseased Clustering</strong></center>",
      col.names = c("Patient Type","Complete Linkage", "Average Linkage", "Single Linkage")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Healthy versus Diseased Clustering
Patient Type Complete Linkage Average Linkage Single Linkage
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Healthy 1 1 1
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2
Diseased 2 2 2

2.4 Bonus

Interesting question.

It is clear that PC1 creates the best differentiation between healthy and diseased patients. It seems that assessing each genes’ contribution to PC1 may be a reasonable metric to use in finding genes that differ the most between healthy and diseased patients.

Let’s look at the plots below. It appears that there are a small proportion of genes (~100 by the appearance of the second plot) that have moderate rotations in PC1, and the remaining genes have small rotations.

fviz_contrib(pr.out, choice = "var", axes = 1, top = 25)

fviz_contrib(pr.out, choice = "var", axes = 1, top = 150)

As a control metric, I created a data frame of the differences in each gene’s average expression in healthy and diseased patients. I then compared these differences to the absolute value of the same gene’s rotation in PC1.

The table below is long (sorry), but I wanted to make the point that there appears to be a relationship between a gene’s rotation in PC1 and the mean difference in gene expression between healthy and diseased patietns as calculated in the dataframe below. Note that both steadily decrease as you move down the table.

q2b<-t(q2)
q2b1<-data.frame(q2b[,1:20])
q2b2<-data.frame(q2b[,21:40])
q2bdiff<-rowMeans(q2b1)-rowMeans(q2b2)

q2bdiff.mean<-data.frame(1:1000, abs(q2bdiff),abs(pr.out$rotation[,1]))
names(q2bdiff.mean)<- c("gene", "mean_diff", "pca_rot")


head(q2bdiff.mean[order(-q2bdiff.mean$pca_rot),],150)%>%
   kbl( 
    caption = "<center><strong>Mean Differences Between Diseased and Healthy and PC1 Rotations</strong></center>",
      col.names = c("Gene","Mean Differences", "PC1 Rotation")) %>%
  kable_classic(full_width = T, html_font = "Cambria")
Mean Differences Between Diseased and Healthy and PC1 Rotations
Gene Mean Differences PC1 Rotation
502 502 2.5444609 0.0948504
589 589 2.3434914 0.0944977
565 565 2.4708205 0.0918382
590 590 2.2053241 0.0917317
600 600 2.7475772 0.0916732
551 551 2.3428449 0.0876836
593 593 2.4020959 0.0875862
538 538 2.3919786 0.0874540
584 584 2.6019855 0.0869086
509 509 2.3531934 0.0866102
511 511 2.3132558 0.0865513
570 570 2.2706461 0.0862646
599 599 2.4138156 0.0862431
535 535 2.2402204 0.0861009
528 528 1.9924967 0.0856143
592 592 2.1842740 0.0855861
566 566 2.1213545 0.0855301
564 564 2.1537022 0.0855207
540 540 2.5451736 0.0855074
508 508 2.0809369 0.0854137
554 554 2.4367177 0.0853939
11 11 2.0940957 0.0852155
548 548 2.2571178 0.0851578
536 536 2.2670467 0.0851402
563 563 1.9205127 0.0849285
588 588 2.0072099 0.0847297
555 555 2.0435628 0.0845421
520 520 1.9154914 0.0843101
522 522 2.1465577 0.0841999
558 558 1.8908142 0.0839016
574 574 2.1891454 0.0836927
559 559 1.9134076 0.0836331
530 530 1.9088348 0.0835904
569 569 1.9761596 0.0834494
514 514 2.0745379 0.0833856
15 15 2.1306259 0.0833707
561 561 2.4004469 0.0831013
513 513 2.4153434 0.0829114
16 16 1.9797135 0.0827456
575 575 2.0524602 0.0825844
505 505 2.1354602 0.0825725
503 503 2.2140494 0.0825674
549 549 2.5507567 0.0820319
582 582 2.4960844 0.0814645
539 539 1.8609756 0.0813224
13 13 2.0229594 0.0813182
526 526 1.9735051 0.0812224
515 515 1.7318145 0.0810760
516 516 2.2567546 0.0810632
562 562 2.4655491 0.0808857
568 568 2.5194183 0.0808201
547 547 1.9936445 0.0804744
527 527 1.9564579 0.0804677
12 12 2.3166049 0.0802807
571 571 2.1393023 0.0802315
534 534 1.9501233 0.0801885
579 579 1.9758579 0.0797707
523 523 2.2825173 0.0795457
541 541 2.0239611 0.0794231
507 507 1.9247132 0.0792445
550 550 1.9036654 0.0791731
18 18 1.8587970 0.0790526
504 504 1.8871367 0.0786392
545 545 2.0909676 0.0785524
529 529 2.3803677 0.0776381
586 586 2.0450671 0.0776283
533 533 2.0288908 0.0775131
544 544 1.8297158 0.0774975
521 521 2.0283827 0.0773548
531 531 1.8268448 0.0772049
595 595 1.8465195 0.0769854
512 512 1.8597888 0.0769505
596 596 2.0054760 0.0767437
517 517 1.9987917 0.0766804
546 546 2.2631532 0.0766770
542 542 1.9397265 0.0765169
587 587 2.0515892 0.0761105
583 583 2.0829797 0.0760889
598 598 1.8171639 0.0758403
591 591 1.9952959 0.0757735
557 557 1.6179701 0.0756431
501 501 1.7050061 0.0749214
580 580 1.5528136 0.0743236
524 524 2.1133090 0.0742861
572 572 2.0012904 0.0742220
20 20 1.9473131 0.0741622
553 553 1.6619197 0.0737646
17 17 1.8163671 0.0735899
532 532 1.6535325 0.0735879
597 597 1.7652763 0.0734921
519 519 1.4478523 0.0732986
567 567 1.9522874 0.0730010
556 556 1.8509786 0.0725488
506 506 1.8641219 0.0724897
577 577 1.8054775 0.0719603
537 537 1.4820046 0.0718737
576 576 1.9730217 0.0718200
560 560 1.8876578 0.0715854
14 14 1.8176984 0.0714753
19 19 1.6468829 0.0706200
552 552 1.6263043 0.0692426
525 525 1.5367276 0.0685103
543 543 1.7907337 0.0682978
581 581 1.5139918 0.0676986
585 585 1.5566643 0.0663186
578 578 1.5436369 0.0662102
594 594 1.4982538 0.0650860
510 510 1.5262318 0.0640798
156 156 1.0931812 0.0623729
518 518 1.5009283 0.0619118
573 573 1.6040790 0.0572774
135 135 0.9453316 0.0562318
967 967 0.9914602 0.0531267
448 448 0.9552971 0.0519146
615 615 1.0829394 0.0510663
853 853 0.9854869 0.0492812
939 939 0.7587715 0.0477184
172 172 0.8246742 0.0471786
243 243 0.8522348 0.0448446
393 393 0.6504216 0.0448403
116 116 0.7591049 0.0447638
457 457 0.7144575 0.0435036
331 331 0.8396971 0.0433897
98 98 0.5569672 0.0429602
715 715 0.7998315 0.0427917
462 462 0.8080792 0.0423691
751 751 0.6027722 0.0423344
389 389 0.6277691 0.0420895
77 77 0.7570385 0.0419401
900 900 0.5912002 0.0412228
860 860 0.6938482 0.0403874
752 752 0.7357372 0.0402600
293 293 0.7302956 0.0400526
802 802 0.6565890 0.0399415
61 61 0.6582499 0.0395674
337 337 0.6959353 0.0391795
768 768 0.7347575 0.0386251
686 686 0.6094919 0.0380952
326 326 0.5900243 0.0380079
346 346 0.5524822 0.0379718
104 104 0.6662910 0.0379223
263 263 0.6930379 0.0377150
142 142 0.7289325 0.0376839
373 373 0.6164435 0.0374662
49 49 0.7397824 0.0372169
680 680 0.6132496 0.0369390
214 214 0.7701767 0.0367594
634 634 0.6234604 0.0366687
852 852 0.6897570 0.0366071
675 675 0.5788977 0.0366056

Lastly, the plots below reinforce the idea that there is an obvious relationship between genes’ PC1 rotations and mean differences in expression between health and diseased patients. Note that there appear to be right skews in both of these measures. It appears that a threshold of 1.3 in mean difference and 0.055 in PC1 rotation appear to distinguish outlying genes from the rest of the pack.

library(patchwork)

p1<-ggplot(q2bdiff.mean, aes(x=mean_diff, y=pca_rot))+
  geom_point()+
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "PC1 Rotation v Mean Difference",
       x = "Mean Difference Between Diseased and Healthy", y = "Absolute Value of PC1 Rotation") +
  geom_vline(xintercept =1.3)+
  geom_hline(yintercept =0.055)+
  theme(aspect.ratio = 1)+
  theme_classic()

p2<-ggplot(q2bdiff.mean, aes(x=mean_diff, fill=mean_diff>1.3))+
  geom_histogram()+
  guides(fill=FALSE)+
  theme_bw()+
  labs(title = "Histogram of Mean Differences Between\nHealthy and Diseased Patients", x="Mean Difference (absolute value)", y="Count")

p3<-ggplot(q2bdiff.mean, aes(x=pca_rot, fill = pca_rot>0.055))+
  geom_histogram()+
  guides(fill=FALSE)+
  theme_bw()+
  labs(title = "Histogram of PC1 Rotations", x="PC1 Rotation (absolute value)", y="Count")

p1 / (p2 + p3)

Using these thresholds, we identified 110 genes that might be of interest. They are displayed below, along with the mean difference in expression between groups and PC1 rotations.

q2bdiff.mean%>%
  filter(pca_rot>0.055 & mean_diff>1.3)%>%
  nrow()
[1] 110
q2bdiff.mean[order(-q2bdiff.mean$pca_rot),]%>%
  filter(pca_rot>0.055 & mean_diff>1.3)%>%
   kbl( 
    caption = "<center><strong>Mean Differences Between Diseased and Healthy and PC1 Rotations</strong></center>",
      col.names = c("Gene","Mean Differences", "PC1 Rotation")) %>%
  kable_classic(full_width = T, html_font = "Cambria")
Mean Differences Between Diseased and Healthy and PC1 Rotations
Gene Mean Differences PC1 Rotation
502 2.544461 0.0948504
589 2.343491 0.0944977
565 2.470820 0.0918382
590 2.205324 0.0917317
600 2.747577 0.0916732
551 2.342845 0.0876836
593 2.402096 0.0875862
538 2.391979 0.0874540
584 2.601986 0.0869086
509 2.353193 0.0866102
511 2.313256 0.0865513
570 2.270646 0.0862646
599 2.413816 0.0862431
535 2.240220 0.0861009
528 1.992497 0.0856143
592 2.184274 0.0855861
566 2.121354 0.0855301
564 2.153702 0.0855207
540 2.545174 0.0855074
508 2.080937 0.0854137
554 2.436718 0.0853939
11 2.094096 0.0852155
548 2.257118 0.0851578
536 2.267047 0.0851402
563 1.920513 0.0849285
588 2.007210 0.0847297
555 2.043563 0.0845421
520 1.915491 0.0843101
522 2.146558 0.0841999
558 1.890814 0.0839016
574 2.189145 0.0836927
559 1.913408 0.0836331
530 1.908835 0.0835904
569 1.976160 0.0834494
514 2.074538 0.0833856
15 2.130626 0.0833707
561 2.400447 0.0831013
513 2.415343 0.0829114
16 1.979713 0.0827456
575 2.052460 0.0825844
505 2.135460 0.0825725
503 2.214049 0.0825674
549 2.550757 0.0820319
582 2.496084 0.0814645
539 1.860976 0.0813224
13 2.022959 0.0813182
526 1.973505 0.0812224
515 1.731815 0.0810760
516 2.256755 0.0810632
562 2.465549 0.0808857
568 2.519418 0.0808201
547 1.993645 0.0804744
527 1.956458 0.0804677
12 2.316605 0.0802807
571 2.139302 0.0802315
534 1.950123 0.0801885
579 1.975858 0.0797707
523 2.282517 0.0795457
541 2.023961 0.0794231
507 1.924713 0.0792445
550 1.903665 0.0791731
18 1.858797 0.0790526
504 1.887137 0.0786392
545 2.090968 0.0785524
529 2.380368 0.0776381
586 2.045067 0.0776283
533 2.028891 0.0775131
544 1.829716 0.0774975
521 2.028383 0.0773548
531 1.826845 0.0772049
595 1.846520 0.0769854
512 1.859789 0.0769505
596 2.005476 0.0767437
517 1.998792 0.0766804
546 2.263153 0.0766770
542 1.939726 0.0765169
587 2.051589 0.0761105
583 2.082980 0.0760889
598 1.817164 0.0758403
591 1.995296 0.0757735
557 1.617970 0.0756431
501 1.705006 0.0749214
580 1.552814 0.0743236
524 2.113309 0.0742861
572 2.001290 0.0742220
20 1.947313 0.0741622
553 1.661920 0.0737646
17 1.816367 0.0735899
532 1.653533 0.0735879
597 1.765276 0.0734921
519 1.447852 0.0732986
567 1.952287 0.0730010
556 1.850979 0.0725488
506 1.864122 0.0724897
577 1.805478 0.0719603
537 1.482005 0.0718737
576 1.973022 0.0718200
560 1.887658 0.0715854
14 1.817698 0.0714753
19 1.646883 0.0706200
552 1.626304 0.0692426
525 1.536728 0.0685103
543 1.790734 0.0682978
581 1.513992 0.0676986
585 1.556664 0.0663186
578 1.543637 0.0662102
594 1.498254 0.0650860
510 1.526232 0.0640798
518 1.500928 0.0619118
573 1.604079 0.0572774